{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>RegEx in Python</center>\n",
    "\n",
    "![](images/memes/meme5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Character Classes\n",
    "\n",
    "- The **character classes** (also known as **character sets**) allow us to define a character that will match if any of the defined characters on the set is present.\n",
    "\n",
    "\n",
    "- To define a character class, we should use the opening square bracket metacharacter `[`, then any accepted characters, and finally close with a closing square bracket `]`.\n",
    "\n",
    "### Example 1\n",
    "\n",
    "Consider an example below where we have messed up between `license` and `licence` spellings and want to find all occurances of `license`/`licence` in the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from utils import highlight_regex_matches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"\"\"\n",
    "Yesterday, I was driving my car without a driving licence. The traffic police stopped me and asked me for my \n",
    "license. I told them that I forgot my licence at home. \n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"licen[cs]e\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['licence', 'license', 'licence']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Yesterday, I was driving my car without a driving \u001b[43m\u001b[1mlicence\u001b[0m. The traffic police stopped me and asked me for my \n",
      "\u001b[43m\u001b[1mlicense\u001b[0m. I told them that I forgot my \u001b[43m\u001b[1mlicence\u001b[0m at home. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/example2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Character Set Range\n",
    "\n",
    "> It is possible to also use the range of a character. This is done by leveraging the hyphen symbol (-) between two related characters; for example, to match any lowercase letter we can use `[a-z]`. Likewise, to match any single digit we can define the character set `[0-9]`.\n",
    "\n",
    "Let us consider an example in which we want to retrieve all the years from the given text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"\"\"\n",
    "The first season of Indian Premiere League (IPL) was played in 2008. \n",
    "The second season was played in 2009 in South Africa. \n",
    "Last season was played in 2018 and won by Chennai Super Kings (CSK).\n",
    "CSK won the title in 2010 and 2011 as well.\n",
    "Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"[1-9][0-9][0-9][0-9]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The first season of Indian Premiere League (IPL) was played in \u001b[43m\u001b[1m2008\u001b[0m. \n",
      "The second season was played in \u001b[43m\u001b[1m2009\u001b[0m in South Africa. \n",
      "Last season was played in \u001b[43m\u001b[1m2018\u001b[0m and won by Chennai Super Kings (CSK).\n",
      "CSK won the title in \u001b[43m\u001b[1m2010\u001b[0m and \u001b[43m\u001b[1m2011\u001b[0m as well.\n",
      "Mumbai Indians (MI) has also won the title 3 times in \u001b[43m\u001b[1m2013\u001b[0m, \u001b[43m\u001b[1m2015\u001b[0m and \u001b[43m\u001b[1m2017\u001b[0m.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> There is another possibility—the negation of ranges. We can invert the meaning\n",
    "of a character set by placing a caret (`^`) symbol right after the opening square\n",
    "bracket metacharacter (`[`).\n",
    "\n",
    "For example, to find all the characters used in a text except vowels, we can use the pattern:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"[^aeiou]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['\\n',\n",
       " 'T',\n",
       " 'h',\n",
       " ' ',\n",
       " 'f',\n",
       " 'r',\n",
       " 's',\n",
       " 't',\n",
       " ' ',\n",
       " 's',\n",
       " 's',\n",
       " 'n',\n",
       " ' ',\n",
       " 'f',\n",
       " ' ',\n",
       " 'I',\n",
       " 'n',\n",
       " 'd',\n",
       " 'n',\n",
       " ' ',\n",
       " 'P',\n",
       " 'r',\n",
       " 'm',\n",
       " 'r',\n",
       " ' ',\n",
       " 'L',\n",
       " 'g',\n",
       " ' ',\n",
       " '(',\n",
       " 'I',\n",
       " 'P',\n",
       " 'L',\n",
       " ')',\n",
       " ' ',\n",
       " 'w',\n",
       " 's',\n",
       " ' ',\n",
       " 'p',\n",
       " 'l',\n",
       " 'y',\n",
       " 'd',\n",
       " ' ',\n",
       " 'n',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '0',\n",
       " '8',\n",
       " '.',\n",
       " ' ',\n",
       " '\\n',\n",
       " 'T',\n",
       " 'h',\n",
       " ' ',\n",
       " 's',\n",
       " 'c',\n",
       " 'n',\n",
       " 'd',\n",
       " ' ',\n",
       " 's',\n",
       " 's',\n",
       " 'n',\n",
       " ' ',\n",
       " 'w',\n",
       " 's',\n",
       " ' ',\n",
       " 'p',\n",
       " 'l',\n",
       " 'y',\n",
       " 'd',\n",
       " ' ',\n",
       " 'n',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '0',\n",
       " '9',\n",
       " ' ',\n",
       " 'n',\n",
       " ' ',\n",
       " 'S',\n",
       " 't',\n",
       " 'h',\n",
       " ' ',\n",
       " 'A',\n",
       " 'f',\n",
       " 'r',\n",
       " 'c',\n",
       " '.',\n",
       " ' ',\n",
       " '\\n',\n",
       " 'L',\n",
       " 's',\n",
       " 't',\n",
       " ' ',\n",
       " 's',\n",
       " 's',\n",
       " 'n',\n",
       " ' ',\n",
       " 'w',\n",
       " 's',\n",
       " ' ',\n",
       " 'p',\n",
       " 'l',\n",
       " 'y',\n",
       " 'd',\n",
       " ' ',\n",
       " 'n',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '1',\n",
       " '8',\n",
       " ' ',\n",
       " 'n',\n",
       " 'd',\n",
       " ' ',\n",
       " 'w',\n",
       " 'n',\n",
       " ' ',\n",
       " 'b',\n",
       " 'y',\n",
       " ' ',\n",
       " 'C',\n",
       " 'h',\n",
       " 'n',\n",
       " 'n',\n",
       " ' ',\n",
       " 'S',\n",
       " 'p',\n",
       " 'r',\n",
       " ' ',\n",
       " 'K',\n",
       " 'n',\n",
       " 'g',\n",
       " 's',\n",
       " ' ',\n",
       " '(',\n",
       " 'C',\n",
       " 'S',\n",
       " 'K',\n",
       " ')',\n",
       " '.',\n",
       " '\\n',\n",
       " 'C',\n",
       " 'S',\n",
       " 'K',\n",
       " ' ',\n",
       " 'w',\n",
       " 'n',\n",
       " ' ',\n",
       " 't',\n",
       " 'h',\n",
       " ' ',\n",
       " 't',\n",
       " 't',\n",
       " 'l',\n",
       " ' ',\n",
       " 'n',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '1',\n",
       " '0',\n",
       " ' ',\n",
       " 'n',\n",
       " 'd',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '1',\n",
       " '1',\n",
       " ' ',\n",
       " 's',\n",
       " ' ',\n",
       " 'w',\n",
       " 'l',\n",
       " 'l',\n",
       " '.',\n",
       " '\\n',\n",
       " 'M',\n",
       " 'm',\n",
       " 'b',\n",
       " ' ',\n",
       " 'I',\n",
       " 'n',\n",
       " 'd',\n",
       " 'n',\n",
       " 's',\n",
       " ' ',\n",
       " '(',\n",
       " 'M',\n",
       " 'I',\n",
       " ')',\n",
       " ' ',\n",
       " 'h',\n",
       " 's',\n",
       " ' ',\n",
       " 'l',\n",
       " 's',\n",
       " ' ',\n",
       " 'w',\n",
       " 'n',\n",
       " ' ',\n",
       " 't',\n",
       " 'h',\n",
       " ' ',\n",
       " 't',\n",
       " 't',\n",
       " 'l',\n",
       " ' ',\n",
       " '3',\n",
       " ' ',\n",
       " 't',\n",
       " 'm',\n",
       " 's',\n",
       " ' ',\n",
       " 'n',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '1',\n",
       " '3',\n",
       " ',',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '1',\n",
       " '5',\n",
       " ' ',\n",
       " 'n',\n",
       " 'd',\n",
       " ' ',\n",
       " '2',\n",
       " '0',\n",
       " '1',\n",
       " '7',\n",
       " '.',\n",
       " '\\n']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predefined Character Classes\n",
    "\n",
    "There exist some predefined character classes which can be used as a shortcut for some frequently used classes.\n",
    "\n",
    "\n",
    "<table style=\"border: 1px solid black; font-size:15px;\">\n",
    "<thead>\n",
    "    <th>Element</th>\n",
    "    <th>Description</th>\n",
    "</thead>\n",
    "    \n",
    "<tbody>\n",
    "<tr>\n",
    "    <td>.</td>\n",
    "    <td>This element matches any character except newline</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\d</td>\n",
    "    <td>This matches any decimal digit; this is equivalent to the class [0-9]</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\D</td>\n",
    "    <td>This matches any non-digit character; this is equivalent to the class [^0-9]</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\s</td>\n",
    "    <td>This matches any whitespace character; this is equivalent to the class\n",
    "[ \\t\\n\\r\\f\\v]</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\S</td>\n",
    "    <td>This matches any non-whitespace character; this is equivalent to the class\n",
    "[^ \\t\\n\\r\\f\\v]</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\w</td>\n",
    "    <td>This matches any alphanumeric character; this is equivalent to the class\n",
    "[a-zA-Z0-9_]</td>\n",
    "</tr>\n",
    "    \n",
    "<tr>\n",
    "    <td>\\W</td>\n",
    "    <td>This matches any non-alphanumeric character; this is equivalent to the\n",
    "class [^a-zA-Z0-9_]</td>\n",
    "</tr>\n",
    "</tbody>\n",
    "</table>\n",
    "\n",
    "\n",
    "Now, we can improve our pattern to find years in a given text a bit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"[1-9]\\d\\d\\d\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us try to find out all special symbols (non-alphanumeric, non-whitespace characters) in our text now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['(', ')', '.', '.', '(', ')', '.', '.', '(', ')', ',', '.']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(\"[^\\w\\s]\", txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/memes/meme6.jpg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
